On-Line Learning Theory of Soft Committee Machines 
with Correlated Hidden Units 
— Steepest Gradient Descent and Natural Gradient Descent — 
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The permutation symmetry of the hidden units in multilayer perceptrons causes the saddle struc- 
ture and plateaus of the learning dynamics in gradient learning methods. The correlation of the 
weight vectors of hidden units in a teacher network is thought to affect this saddle structure, result- 
ing in a prolonged learning time, but this mechanism is still unclear. In this paper, we discuss it with 
regard to soft committee machines and on-line learning using statistical mechanics. Conventional 
gradient descent needs more time to break the symmetry as the correlation of the teacher weight 
vectors rises. On the other hand, no plateaus occur with natural gradient descent regardless of the 
correlation for the limit of a low learning rate. Analytical results support these dynamics around 
the saddle point. 

PACS numbers: 07.05.Mh, 05.90.-(-m 



I. INTRODUCTION 



One of the biggest problems of neural network learning 
is the plateau of the learning curve. Considering the gra- 
dient learning method and its generalization error, this 
plateau is mainly caused by the saddle structure of the 
error function. The permutation symmetry prevents the 
identification of the hidden units in multilayer percep- 
trons if they have the same weight vectors, and produces 
this saddle structure ||. In the learning scenario of a 
teacher and a student network, the saddle is thought to 
be affected by the strength of the correlation of the hid- 
den units in the teacher network, which may be closely 
related to the length of the plateau. More specifically, 
in the conventional gradient descent (GD), the weight 
vectors in the student network are known to approach 
the saddle before reaching their final states [||. Since 
the saddle is located between the weight vectors of the 
teacher hidden units, their stronger correlation is sup- 
posed to force the student weight vectors closer to the 
saddle, resulting in a longer plateau. 

Natural gradient descent (NGD), however, may be able 
to avoid the saddle because it can update the network 
parameters to the optimal direction in the Riemannian 
space NGD is a fairly general method for effectively 
adjusting the parameters of stochastic models, but its 
validity in multilayer perceptrons is uncertain because of 
three intrinsic problems: 1) NGD needs prior knowledge 
of the input distribution to calculate the Fisher infor- 
mation matrix, 2) NGD is unstable around the singular 
points of the Fisher information matrix, 3) matrix inver- 
sion is time consuming, which might be critical especially 
in real-time learning. The method proposed by Yang and 
AmariQ can be used to calculate NGD efhciently in the 
case of a large input dimension in multilayer perceptrons. 



Also, the adaptive method can be used to approximate 
the inverse of the Fisher information matrix asymptoti- 
cally without prior knowledge or matrix inversion |^ . In 
this paper, we discuss the problem of singularity; since 
the saddle is one of the singular points, how NGD works 
around there is one of our main topics. 

On-line learning is one of the most popular forms of 
training. Analysis of the network dynamics in on-line 
learning is much easier than for batch learning because 
the state of the network and the learning samples are 
independent of each other. In this framework, the statis- 
tical mechanics method proposed by Saad and SoUa can 
be used to analyze the GD dynamics exactly at the large 
limit of the input dimension Rattray and Saad ex- 
tended this technique to NGD and reported that it works 
efficiently in multilayer perceptrons In this paper, we 
also use this method and contrast the dynamics for GD 
and NGD, focusing on the corrupted saddle structure un- 
der a strong correlation of the hidden units in the teacher 
network. 



II. MODEL 

Soft committee machines (Fig. ^ are considered where 
the teacher network has M hidden units while the stu- 
dent has K units. To apply NGD, Gaussian noise n ~ 
A/'(0,a^) is added to the output of the student; 
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FIG. 1: Teacher and student networks. Each weight between 
any hidden unit and the output is fixed to 1. 



where ^ G K denotes the input vector while Bi G R 
and Ji S 'M.^ are the ith weight vectors of the teacher 
and the student networks, respectively. Here, T means 
the transposition while g is an activation function. 

The joint probability distribution of the input ^ and 
the output C of the student network is given by 

PjitC)^pii)pj{C\i), (3) 
P.7(CI0^7^exp(- i^'-/;P>^ ). (4) 

The parameter vector of (^, J = [Jf", Jj, J|^]-^ G 
, is updated iteratively to approximate the joint 
probability distribution of the input ^ and the output 
of the teacher network, 

p(^,C)=p(0^(C-/b(0), (5) 

where S is the delta function. The loss function for a 
given set of a learning sample {^,C}i defined using the 
logarithmic loss of the conditional probability distribu- 
tion of (||), is 

eji^, C) ^ - \npjm + CO = ^{C - fAOV, (6) 

where co = — In \/2no^ is constant. The generalization 
error is then defined as the expected loss: 

e,{J)^{ej{tO){€X}- (7) 

The definitions of can be written, by applying and 
(|), as 

C) = ejiO ^ ^ {fsiO ■ (8) 

We consider on-line learning in this paper, where the 
parameter vector J is updated for each set of an inde- 
pendently given sample {^, C}- The updating rule, the 
differential of J, for GD is defined with a learning rate rj 
as 

AJ^-^Vjej{tO, (9) 



where 

yMejitC) = '^9'{jT^){fBiO-fjmt (10) 

where g' denotes the derivative of g. One for NGD is also 
defined as 

AJ^-^G-^Wjej{tC), (11) 

where G denotes the Fisher information matrix of the 
parameter vector J: 

G^{[Vj \npjit cm J lnp^(|, CT){iX'}- (12) 
The G can be written, in block form, as 

Gi^i ■ ■ ■ Gi^K 
G= : •. : , 
Gka ■ ■ ■ Gk,k_ 

G.u-^(5'(Jf«5'(JjOlO{€}- (13) 

In the case of the standard multivariate normal distribu- 
tion input, ^ ^ Af{0,I), the inverse of the Fisher infor- 
mation matrix is also given by 
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GT^=o^e,,I + J'@,,J'^}, (14) 

where J' = [Ji, Jk] '\s & N hy K matrix, while 0y is 
a scalar and 0y is a by iiT matrix 

III. THEORY 
A. Order parameters and generalization error 

At the thermodynamics limit, the limit of iV — s- oo, the 
dynamics of the network can be analyzed using statistical 
mechanics. Here, the order parameters that represent 
the correlations of the weight vectors are used instead of 
the A^-dimensional vectors ^, Bi, and Ji. To make the 
present paper self-contained, we briefly summarize the 
derivation of the order parameter equations of the soft 
committee machine [|], g. 

From here on, the input vector is assumed to obey 
a A'^-dimensional multivariate Gaussian noise with zero 
mean and a unit covariance matrix: ^ ^ M{Q,I). The 
correlation between the input and each weight vector, 
denoted by Xi = Jf £, and yi = Bf^, is then dis- 
tributed as a normal distribution; Xi ~ A/'(0, Ji) 
and yi ^ Af{0, Bf Bi), while each covariance of them 
is given by {xiXj){^y = JfJj, {xiyj){Q = JfBj, and 
{yiyj){Q = BfBj. Therefore, a new vector, defined as 

z = [xi,...,XK,yi,-,yMf (15) 
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is distributed as a multivariate normal distribution 
AA(0,C): 



p(z)^ — exp ( — -z^C 



where C is the variance-covariance matrix: 
C 

with 



Q R 
T 



(16) 



(17) 



Q = J'^J' 



R=J'^B' = 



T = B'^B' 
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(18) 



(19) 



(20) 



and J' = [Ji • • • Jk], B' = [Bi - - Bm]- Here, Q and R 
are the order parameters of this system. 

Using these order parameters, the generalization error 
in eg{J) = (ej(^, C)){€,C}' be calculated by 

e,(J) = /'d^p(^)-L/^g(y,)_^g(^,)|(2i) 

k=l ) 

If we define the activation function g as g{x) = erf (x/\/2) 
from here on, the generalization error is given by 
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(22) 



which depends on only the order parameters. 

B. Dynamics of the order parameters 

Here we substitute the dynamics of the order param- 
eters for those of the system. First, we can replace the 
updating rule (O) with 



''At 



(23) 



where 



{M K ~\ 

k=l k=l ) 



(24) 



Thus, the updating rule of the order parameters is given 
by 



Ai?„- = [J, + AJ,fBj^JlB, 



(25) 



and 



AQy = [J^ + AJ,f[JJ+AJ,]-JfJJ 

2 

= -^{^^^J- + + (26) 



Here we introduce the time a; a short period, Aa — 
1/N, is defined to be consumed for each learning itera- 
tion. At the large limit of N, the differential dynamics 
of Rij and Qij are calculated as 



^ = lim = lim NAR 

da Aa^o Aa n^oc 



and 



{5^Vj){z} 

(27) 
(28) 



where ^"^^ — > TV is applied, while the new variables are 
defined as 

-^ii = (^i2/i){4, (i>rj = {SiXj){z\, = {5i5j){z\, (29) 
The dynamics for NGD can be provided in the same way: 

K 



k=l 

Thus, the dynamics of the order parameters are 

K 



(30) 



dR^ 



da 



= -f?^ {^'jfcV'fei + 0fe.0fe,;i?.j}, (31) 



dQ 



k=l 



K 



g = -y {Qik4'kj +0jk't'ki+4'k,®fkQ»j 

k=l 

K 

+</.fc.0jfeQ.J + 77' E ^^kOjiVki, (32) 



where 1/)^, denotes the fcth row of the matrix 
{4'ij}i,j=i,....K, while Rtj denotes the jth column of the 
matrix R, and so on Iql. 
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FIG. 2: Time evolution of the generalization error in GD (a) and NGD (b). All the trajectories are almost completely overlapped 
in (b). The plateau periods in (a) were measured and are shown in (c). MCS denotes the number of Monte Carlo steps. 




FIG. 3: Time evolution of the order parameters (-Ri,i, i?i,2) and {R2,i, R2,2) in GD (a) and NGD (b). The correlation of the 
teacher weight vectors Ti,2 = 0.75. Start points: □; turning points: A; the saddle: 0; and goals: . 



IV. NUMERICAL RESULTS 

In this section, we discuss how the learning dynamics 
depend on the correlation of teacher weight vectors Ti_2- 
The results are also contrasted between GD and NGD. 

We set the number of the hidden units and the lengths 
of the teacher weight vectors as follows: 

K = M = 2, Ti,i = T2,2. (33) 

We also restrict the initial conditions to 



Therefore, we have four free parameters Qi,2: Ri,i, 
and i?i.2 in this system. Note that Q and T are always 
symmetric matrices from the definitions of (|l^) and (|2^) . 
Other parameters are set as Ti 2 — 1, rj — 10^^, and 
CT^ = 5 X 10~^. Various values for Ti_2 are employed to 
examine the influence of the correlation of the teacher 

hidden units. We sometimes use k = arccosS^, the 

J 1,1 

angle of the teacher weight vectors, instead of Ti^2- 

In this case, 6ij and 0^ in the inverse of the Fisher 
information matrix (|lj) can be simplified as 



Q2 



^1, 



i?2 



Rl.2 = R2.1- 



(34) 



Because of the symmetry of the system, these restrictions 
are preserved throughout the learning. Specifically, we 



Q = 



1 
1 



R 



10-2 






10-2 



(35) 



©1,1 

02,2 
01,2 



2,2 = cVa, Oi,2 = 6*2,1 = -cVb, 
2a{a-b}-bQii bQi2 
bQi,2 -bQiA ' 



d\/a 

02,1 



foQi,2 

bQi.2 2a{a — b} — bQi,i 



(36) 
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FIG. 4: Time evolution of the generalization error for the case of N — 1000 in GD (a) and NGD (b). 




FIG. 5: Time evolution of the order parameters (-Ri.i, i?i,2) and (i?2,i, -R2,2) for the case of N = 1000 in GD (a) and NGD (b) 
as in Fig. ^. 



-dVb 



a{Qis + l}-b^ 



aQi.2 
a{(3i.i + l}-&^ 



where 



{Qi.i + ip-QL, 
d. 



2 a-b ' 



6 = 2Qi,i- 



1, 



(37) 



Here we summarize the order of each variable to N . 
Since the length of the input vector ^ is 0{Vn), Xi and 
Ui are 0(1). This guarantees that the arguments of the 
activation function g are 0{1). Therefore, the lengths 
of the weight vectors, ^^Qu and \/Tii, are 0(1). If the 
direction of the initial Ji is chosen randomly, the size 
of Rii, the correlation between Ji and Bi, is 0{1/^/N). 
The initial numerical values in ( |35| ) are defined according 
to these sizes. 

Figure |2| shows the time evolution of the generaliza- 
tion error. In the GD (Fig. ^), the plateau was greatly 
prolonged as the correlation of the teacher weight vectors 



rose. In NGD (Fig. ||b), almost no plateau occurred at 
any T12 if was set small enough relative to the initial 
and the generalization error was exponentially de- 
creased. The plateau periods of Fig. |l|a were measured 
and are shown in Fig. |^c, where we defined a plateau as 
occurring if 



d In 



> 



-0.0005. The order of the plateau 
lengths was about 0{k~^) in GD. 

Figure H shows the trajectories of the order parameters 
(i?i4,i?i^2) and (-R2,i, -^2,2)- Because of the symmetry, 
the latter plots are mirror images of the former. As Ria 
is the correlation between the first student and the cor- 
responding teacher, the initial value is almost and the 
goal is 1; Ri^2 is the correlation between the first student 
and the not corresponding teacher, and the initial value 
is almost and the goal is 71,2- Therefore, the target lo- 
cation of the plots are (l,Ti^2) and (Ti^2, 1), respectively 
(shown as <0>). The other order parameters Qi.i and Qi,2 
are not shown. In the case of GD (Fig. ^), the plots 
start at □, turn back at A, then approach (the sad- 
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die, as explained in the next section), and finally reach 
<0>. Actually, the parameters never pass through the same 
place again because (5i,i and Qi,2 are updated. In the 
case of NGD (Fig. the plots start at □ and reach <) 
while avoiding 0. 

We performed a numerical simulation to confirm the 
dynamics at the above thermodynamics limit. The input 
dimension was N = 1000, the teacher weight vectors were 
set as 















V 






\ 




I 





T 




"cos K 









sin K 


Bi = 












.0. 








(38) 



and every initial Ji was randomly and independently cho- 
sen from A/'(0, 1/N) for each try. Thus, the order param- 
eters Q and R were no longer limited by the restriction of 
( ^ ) . The learning was performed using these real weight 
vectors and the original equations: (^ for GD and ( p^ ) 
for NGD. Figures ^ and ^ show the time evolution of 
the generalization error and the trajectories of the order 
parameters in the same manner as Figs. ^ and ^, re- 
spectively. Both figures support the statistical dynamics 
well, which suggests the constraint of ( |3^ ) is a rather mi- 
nor problem and the system retains most of its generality 
even with that restriction. 



FIG. 6: The student weight vectors J\ and J2 belong to the 
plane made by the teacher weight vectors B\ and B2. 



correlations are re-parameterized by k and A as 
Ti,2 = cosK, Qi,2 = Qi,i cos A, 

i?l,2 = ^Ql.lTl.lCOS (40) 

Now, we have only two free parameters Qi^i and A. Since 
the first derivative of A can be written with Qi,i and Q1.2 
as 



da da 



Ql,2 _ Ql,2^§^ 



1 3Ql,2 



^^■1 QmaM,1-Q?,2 



(41) 



V. SADDLE 

Here, we discuss why NGD is so effective even with 
a strong correlation between teacher hidden units. We 
consider the dynamics around the saddle of the general- 
ization error under the conditions of ( p3[ ) and (p4|). This 
point, where all the differentials of the order parameters 
are zero and the Hessian matrix is not positive definite 
nor negative definite, is shown as in Figs. ^ and |^: 



Qi,i — Qi,2 — 
Ri,i ~ Ri,2 = 



Til — Ti^2 + 2 ' 

ri,i + ri,2 

v/2{Ti,i-ri,2 + 2}' 



(39) 



This saddle is a special point because 1) it corresponds 
to the goal both in the case of Ti 1 = Ti,2 (the teacher 
is a smaller network: /b(^) = 2g{B'[^)) and in the case 
that the student is a smaller network: fj{^) = 2g{j]^^), 
2) in GD, the plateau occurs around it, and in NGD the 
student vectors avoid it, 3) it coincides with one of the 
singular points of the Fisher information matrix since 
= Qi,2- We simplify the situation as shown in Fig. 
g the two student weight vectors belong to the plane 
made by the two teacher weight vectors. This simplifica- 
tion is useful because we are now interested in how fast 
the student vectors leave this point for the goals. The 



we can formulate the angular velocity of A at < A <C 
1. The term 77^ included in can be ignored if the 
learning rate rj is set small enough. 
The angular velocity for GD is 



— — CiA sm K, 
da 



(42) 



where ci = ^Ti,i{ri,i{l-cos4-|-2}-^{Ti,i{3-fcosK}-|- 
2}^5. We notice that the order of ci is not greatly 
changed by k. The velocity converges to zero in the 
first order of A. Moreover, it decreases as k decreases. 
Therefore, this equation supports the simulation results 
showing that the plateau is prolonged as the teacher cor- 
relation rises. The angular velocity for NGD is 



dX 1 ^ n 
— = C2-tan -, 
aa A 2 



(43) 



where C2 = 2?]. This velocity diverges to infinity as A goes 
to zero. Although it decreases as k decreases, this effect 
would be canceled by A~^ near the saddle. Therefore, 
this equation means that the student weight vectors are 
repelled by the saddle. In addition, this also supports 
the simulation results showing that the student weight 
vectors avoid the saddle and that the plateau does not 
occur even in the case of strongly correlated teacher hid- 
den units. 
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VI. CONCLUSION 

We have studied the on-line learning of soft commit- 
tee machines under correlated teacher hidden units. The 
plateau in GD is largely prolonged at about 0{k~^) as 
the correlation of the teacher weight vectors rises, but al- 
most no plateau occurs in NGD with a low learning rate 



ry and this does not depend on the correlation. Our ana- 
lytical results for around the saddle reveal that the NGD 
avoided the saddle, even though the strong correlation 
of the teacher weight vectors forced the student weight 
vectors close to the saddle where the Fisher information 
matrix is singular. 
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